[serve][1/n] Introduce gang scheduling#60802
[serve][1/n] Introduce gang scheduling#60802jeffreywang-anyscale wants to merge 7 commits intoray-project:masterfrom
Conversation
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces gang scheduling for Ray Serve, a significant feature for deploying distributed workloads. The changes are extensive, touching core components like the deployment scheduler and state management, adding new configuration options, and providing a solid suite of tests. My review focuses on a few areas for improvement: enhancing type safety in data classes, improving constant management for better code clarity, making cleanup logic more robust by adding logging, and simplifying some conditional logic. Overall, this is a well-implemented feature.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching a broad Exception and passing silently can hide important issues during cleanup. While cleanup should be robust, it's better to at least log the exception to aid in debugging potential problems. For example, if there's a permission issue or a problem with the GCS connection, we would want to know about it.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.warning(f"Failed to remove placement group {pg.id}: {e}") |
| gang_placement_strategy=gang_config.gang_placement_strategy.value | ||
| if hasattr(gang_config.gang_placement_strategy, "value") | ||
| else str(gang_config.gang_placement_strategy), |
There was a problem hiding this comment.
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
537abee to
4b69ebd
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces gang scheduling for Ray Serve, a significant feature for deploying distributed workloads. The implementation is comprehensive, touching core components like the deployment scheduler and state manager. New dataclasses for gang context and requests are added, along with configuration options in DeploymentConfig and the @serve.deployment decorator. The logic for reserving placement groups for gangs and handling the RESTART_GANG failure policy seems well-thought-out. The changes are supported by a good set of unit and end-to-end tests. My review includes a few minor suggestions for improving robustness and code clarity.
| if num_replicas % v.gang_size != 0: | ||
| raise ValueError( | ||
| f"num_replicas ({num_replicas}) must be a multiple of " | ||
| f"gang_size ({v.gang_size})." | ||
| ) |
There was a problem hiding this comment.
The num_replicas value could potentially be None, which would cause a TypeError when the modulo operator is used. While pydantic's default value handling might prevent this, adding a check for num_replicas is not None would make this validator more robust against unexpected None values.
| if num_replicas % v.gang_size != 0: | |
| raise ValueError( | |
| f"num_replicas ({num_replicas}) must be a multiple of " | |
| f"gang_size ({v.gang_size})." | |
| ) | |
| if num_replicas is not None and num_replicas % v.gang_size != 0: | |
| raise ValueError( | |
| f"num_replicas ({num_replicas}) must be a multiple of " | |
| f"gang_size ({v.gang_size})." | |
| ) |
| f"num_replicas_to_add {request.num_replicas_to_add} " | ||
| f"is not divisible by gang_size {gang_size}. " |
There was a problem hiding this comment.
There's a trailing space in the f-string for the error message, which should be removed for cleaner output.
| f"num_replicas_to_add {request.num_replicas_to_add} " | |
| f"is not divisible by gang_size {gang_size}. " | |
| f"num_replicas_to_add {request.num_replicas_to_add} " | |
| f"is not divisible by gang_size {gang_size}." |
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
|
@cursor review |
| f"gang_size ({v.gang_size})." | ||
| ) | ||
|
|
||
| return v |
There was a problem hiding this comment.
Validator crashes when num_replicas is None
High Severity
The validate_gang_scheduling_config validator performs num_replicas % v.gang_size without checking if num_replicas is None. The num_replicas field is Optional[NonNegativeInt] and can be None when autoscaling is used (e.g. num_replicas="auto"). Additionally, in Pydantic v1, if num_replicas fails its own validation, it won't be present in values, causing values.get("num_replicas") to return None. In either case, None % v.gang_size raises a TypeError.
| int32 max_constructor_retry_count = 20; | ||
|
|
||
| // Gang scheduling configuration for atomic replica scheduling. | ||
| GangSchedulingConfig gang_scheduling_config = 21; |
There was a problem hiding this comment.
Proto file modified requires fault-tolerance review notice
Low Severity
This PR modifies src/ray/protobuf/serve.proto. Per the "RPC Fault Tolerance Standards Guide" rule:
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
| # Gang PGs are shared across multiple replicas. | ||
| # Another replica in the same gang may have already | ||
| # removed this PG. | ||
| pass |
There was a problem hiding this comment.
Gang PG silently removed while siblings still running
Medium Severity
When a gang replica finishes stopping, check_stopped removes the shared placement group. Since gang PGs are shared, the first replica to stop removes the PG while sibling replicas may still be running on it. This is especially problematic during downscaling (which has no gang awareness) — individual replicas can be selected for removal, causing the shared PG to be deleted from under still-active gang members. The broad except Exception: pass also silently swallows unrelated errors during PG cleanup.
|
Ready for a first pass while I keep adding tests. |


Description
This PR introduces gang scheduling support to Ray Serve, enabling atomic scheduling of replica groups. Gang scheduling ensures that sets of replicas are scheduled together -- either all succeed or all fail -- which is critical for distributed serving patterns requiring tight coordination between replicas or across multiple deployments. This is a stepping stone to achieve DP group fault tolerance in Ray Serve LLM.
Key decisions
Out of scope for this PR
Test approach
Basic validation
@serve.deploymentsucceeds and responds to requests.optionssucceeds and responds to requestsgang_sizereplicasFailure validation
Placement strategy
Failover / Fault Tolerance
Related issues
RFC: https://docs.google.com/document/d/1IzLTRJo-B8YF74eAKJA6XuDqZtzGOzEtJ1WjkMzroq4/edit?pli=1&tab=t.0
Additional information